[Content Understanding] Add Copilot skills for custom-analyzer authoring by chienyuanchang · Pull Request #47218 · Azure/azure-sdk-for-python

chienyuanchang · 2026-05-28T23:40:37Z

Description

Adds two Copilot skills under .github/skills/ in the azure-ai-contentunderstanding package that walk SDK users (and AI coding agents) through creating a custom analyzer end-to-end using the typed ContentUnderstandingClient already shipped in the package.

Zero public SDK API changes. Skills + scripts live under .github/, which is excluded from the PyPI sdist via the existing include-only MANIFEST.in — pip install azure-ai-contentunderstanding is byte-identical for consumers.

Skill	When to use
`cu-sdk-generate-analyzer`	Single document type per analyzer (e.g. a folder of invoices)
`cu-sdk-generate-analyzer-classify-route`	Multi-doc-type packets (e.g. invoice + bank statement + loan application in one PDF)

Each skill walks: env check → layout extraction → schema authoring (starting from a template) → local validation → analyzer create → batch test → category-aware stdout summary with leaf-level field rollout → optional ephemeral cleanup.

What's in this PR

Path	Kind	Notes
`.github/skills/_shared/schema_validator.py`	New	Pure-JSON validator. No `azure.*` / `requests` / `urllib` imports. Allow-list catches the `prebuilt-documentAnalyzer` typo class before any service call.
`.github/skills/cu-sdk-generate-analyzer/`	New	`SKILL.md` + `scripts/{extract_layout,create_and_test}.py` + `.sh` wrappers + `templates/schema_template.json`
`.github/skills/cu-sdk-generate-analyzer-classify-route/`	New	`SKILL.md` + `scripts/create_and_test_router.py` + `.sh` wrapper + `templates/classifier_template.json`
`.github/skills/cu-sdk-common-knowledge/SKILL.md`	Modified	Added two-stage pipeline rule + classify-and-route rule.
`.github/skills/cu-sdk-sample-run/SKILL.md`	Modified	One-line "next step" hints pointing at the two new skills.
`README.md`	Modified	New rows in the "Available Skills" section.
`.gitignore`	Modified	Comment update only; the existing `.local_only/` rule already covers `.local_only/layout/`, `.local_only/schemas/`, and `.local_only/test_results/` written by the skill scripts.
`tests/test_skills_*.py`	New	19 unit tests (validator purity, classifier wiring, prebuilt passthrough, leaf-row summary, etc.).

End-to-end smoke runs

Both skills were executed against samples/sample_files/mixed_financial_docs.pdf (a packet containing invoice + bank statement + loan application). Ephemeral cleanup of all created analyzers confirmed for both runs.

Single analyzer (`cu-sdk-generate-analyzer`)

========================================================================
[SUMMARY]

category: (single)  (1 document)
--------------------------------
  field                                    fill rate   avg conf
  billToAddress                            100.0%      0.922
  customerId                               100.0%      0.953
  customerName                             100.0%      0.903
  dueDate                                  100.0%      0.979
  invoiceDate                              100.0%      0.977
  invoiceNumber                            100.0%      0.956
  lineItems[].amount                       100.0%      0.928
  lineItems[].date                         100.0%      0.902
  lineItems[].description                  100.0%      0.661
  lineItems[].itemCode                     100.0%      0.819
  lineItems[].quantity                     100.0%      0.932
  lineItems[].tax                          100.0%      0.924
  lineItems[].unitPrice                    100.0%      0.931
  poNumber                                 100.0%      0.961
  salesTax                                 100.0%      0.939
  shipToAddress                            100.0%      0.929
  subtotal                                 100.0%      0.917
  totalAmount                              100.0%      0.939
========================================================================

18 leaf rows reported, 100% fill rate across all extracted fields. Lowest confidence 0.661 on lineItems[].description — exactly the "agent reads the summary, proposes targeted v2 edits" loop the skill is designed for.

Classify-and-route (`cu-sdk-generate-analyzer-classify-route`)

========================================================================
[SUMMARY] (category-aware)

category: bank_statement  (1 segments)
  accountHolder                  100.0%      0.958
  accountNumber                  100.0%      0.962
  statementPeriod                100.0%      0.903

category: invoice  (1 segments)
  invoiceDate                    100.0%      0.979
  invoiceNumber                  100.0%      0.956
  totalAmount                    100.0%      0.872

category: loan_application  (1 segments)
  applicantDateOfBirth           100.0%      0.902
  loanAmountRequested            100.0%      0.886
  signatureDate                  100.0%      0.956
========================================================================

3 categories correctly identified, 9 fields × 100% fill. Per-category denominator verified (each category's fill rate counts only segments classified into that category, not packet-wide total).

All SDK Contribution checklist

The pull request does not introduce [breaking changes] — zero public API changes; .github/ is not in the sdist
CHANGELOG is updated for new features, bug fixes or other significant changes. — N/A: no shipped code changed; skills are dev-experience artefacts under .github/ and do not appear in any released wheel. Happy to add a "Other Changes" CHANGELOG entry if reviewers prefer.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message.

Testing Guidelines

Pull request includes test coverage for the included changes. — 19 unit tests in tests/test_skills_*.py (validator purity, classifier wiring, prebuilt-routing passthrough, single-analyzer argparse + invalid-schema-pre-import + leaf-row summary). Helper scripts are thin wrappers over the typed SDK; the underlying API calls are already covered by the SDK's recorded tests, so the manual smoke run captured above covers the end-to-end path.

Verification steps

cd sdk/contentunderstanding/azure-ai-contentunderstanding

# Run the new unit tests (no Azure auth needed)
python -m pytest tests/test_skills_*.py -v
# expected: 19 passed

# Reproduce the single-analyzer smoke run (requires CONTENTUNDERSTANDING_ENDPOINT)
cp .github/skills/cu-sdk-generate-analyzer/templates/schema_template.json schemas/invoice_v1.json
# edit schemas/invoice_v1.json — replace REPLACE: placeholders
python .github/skills/cu-sdk-generate-analyzer/scripts/create_and_test.py \
    --schema schemas/invoice_v1.json \
    --input samples/sample_files/mixed_financial_docs.pdf \
    --output test_results/v1 --ephemeral

Copilot

Pull request overview

Adds two new GitHub Copilot skills under .github/skills/ of the azure-ai-contentunderstanding package that walk users through authoring custom analyzers end-to-end (single-doc-type and classify-and-route variants), plus a shared pure-Python schema validator and 19 unit tests. No public SDK API changes; assets live in .github/ and are excluded from the sdist.

Changes:

New skills cu-sdk-generate-analyzer and cu-sdk-generate-analyzer-classify-route with helper scripts (extract_layout.py, create_and_test.py, create_and_test_router.py), shell wrappers, and JSON templates.
New shared pure-Python validator (_shared/schema_validator.py) that catches baseAnalyzerId typos and structural errors before any service call.
Updates to cu-sdk-common-knowledge/cu-sdk-sample-run SKILL.md and the package README, plus a small testpreparer.py enhancement (create_client_from_credential endpoint trailing-slash normalization) and 3 new test modules.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/testpreparer.py	Add `create_client_from_credential` override that strips trailing `/` from endpoints.
tests/test_skills_shared_schema_validator.py	Unit tests for the shared validator (purity, accept/reject cases).
tests/test_skills_create_and_test.py	Unit tests for single-analyzer script: `--help`, validator-first behavior, leaf-row summarize.
tests/test_skills_classify_route_router.py	Unit tests for router script: per-category denominator, alias wiring, prebuilt passthrough.
README.md	Add rows for the two new skills in the Available Skills table.
.gitignore	Update comment on `.local_only/` (does not actually add new ignores, contrary to PR description).
.github/skills/cu-sdk-setup/SKILL.md	Note that step numbering is referenced by the new skills.
.github/skills/cu-sdk-sample-run/SKILL.md	Add "next step" hints linking to the two new skills.
.github/skills/cu-sdk-common-knowledge/SKILL.md	Add two-stage pipeline rule, `baseAnalyzerId` table, classify-and-route rules.
.github/skills/_shared/README.md	Document the `_shared/` library directory rules.
.github/skills/_shared/schema_validator.py	Pure-stdlib validator for analyzer schemas.
.github/skills/cu-sdk-generate-analyzer/SKILL.md	New single-doc-type analyzer authoring skill.
.github/skills/cu-sdk-generate-analyzer/scripts/README.md	Quick reference for the two helper scripts.
.github/skills/cu-sdk-generate-analyzer/scripts/extract_layout.{py,sh}	Stage 1 layout extraction helper.
.github/skills/cu-sdk-generate-analyzer/scripts/create_and_test.{py,sh}	Stage 2 validate→create→batch-test→summarize helper.
.github/skills/cu-sdk-generate-analyzer/templates/schema_template.json	Starter single-type schema template.
.github/skills/cu-sdk-generate-analyzer-classify-route/SKILL.md	New classify-and-route authoring skill (contains a duplicated `[ASK USER]` block).
.github/skills/cu-sdk-generate-analyzer-classify-route/scripts/create_and_test_router.{py,sh}	Router script: validate, create inner+outer analyzers, batch test, category-aware summary.
.github/skills/cu-sdk-generate-analyzer-classify-route/templates/classifier_template.json	Starter outer-classifier schema template.

yungshinlintw · 2026-06-17T21:27:54Z

@@ -695,6 +697,8 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
 [cu_sdk_setup_skill]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/contentunderstanding/azure-ai-contentunderstanding/.github/skills/cu-sdk-setup
 [cu_sdk_sample_run_skill]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/contentunderstanding/azure-ai-contentunderstanding/.github/skills/cu-sdk-sample-run


Please add a section of What's New or Changelog, and provide a link to the CHANGELOG. This helps user to track new additions easily.

And we should update CHANGELOG for these new skills too

Will we release as a new version? Or do we just add into CHANGELOG and wait for the next release?

yungshinlintw · 2026-06-17T21:55:06Z

+When using `config.contentCategories` to classify and route mixed-document
+packets:
+
+1. **Category descriptions follow the same text-anchored rule** as field


The category descriptions should be generic enough, and should not enforce the classification to use hardcoded values unless that it's important information.

Updated the example JSON in classify-route Step 3 to use generic, content-kind descriptions and added a callout block right after the example warning the agent not to copy verbatim.

…r-sklls # Conflicts: # sdk/contentunderstanding/azure-ai-contentunderstanding/.github/skills/cu-sdk-common-knowledge/SKILL.md # sdk/contentunderstanding/azure-ai-contentunderstanding/CHANGELOG.md

Sphinx and link-check both flag the relative "samples/sample_create_classifier.py" link in the What's New section. Use the reference-style absolute URL pattern matching the other sample links in this README.

yungshinlintw · 2026-06-20T01:42:42Z

+> Repeat until all key fields reach **fill rate ≥ 80%** and
+> **avg confidence ≥ 0.85**, or the user is satisfied.
+>
+> Stop and report to the user when any of:


I tried this, and found that the skill should report out the following so that the user knows what to do from here:

The name of the final analyzer ID

The file path to the final iteration of the schema file

Point to SDK sample for custom analyzer creation and use of custom analyzers

initial version

5ac4860

github-actions Bot added the Cognitive - Content Understanding label May 28, 2026

chienyuanchang added 14 commits May 28, 2026 18:54

clean and improve classifier

99eae81

clean up

8fadeba

use .local_only as output folder

3639b5d

use our skills to set up env

c59c801

improve by learning from our samples

5714cd6

Merge branch 'main' into cu-sdk/custom-analyzer-sklls

9296d68

fix spell issue

75ae260

fix slash

4adf258

Merge branch 'main' into cu-sdk/custom-analyzer-sklls

7490180

fix slash and spell

e25cd0e

[temp fix] fix endpoint

c9b9dda

revert patch fix

15fe5d0

fix it by rstrip

aaef159

Merge branch 'main' into cu-sdk/custom-analyzer-sklls

9485dae

chienyuanchang marked this pull request as ready for review June 1, 2026 20:58

chienyuanchang requested review from bojunehsu and changjian-wang as code owners June 1, 2026 20:58

Copilot AI review requested due to automatic review settings June 1, 2026 20:58

chienyuanchang requested a review from yungshinlintw as a code owner June 1, 2026 20:58

Copilot started reviewing on behalf of chienyuanchang June 1, 2026 20:58 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread .../azure-ai-contentunderstanding/.github/skills/cu-sdk-author-analyzer-classify-route/SKILL.md

Comment thread sdk/contentunderstanding/azure-ai-contentunderstanding/.gitignore Outdated

chienyuanchang added 2 commits June 1, 2026 14:32

remove dup section

75fb66c

Merge branch 'main' into cu-sdk/custom-analyzer-sklls

54b5a77